Impact Analysis of OCR Quality on Research Tasks in Digital Archives
نویسندگان
چکیده
Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCRinduced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.
منابع مشابه
Sociological Impact of Using Digital (Web-based) Analyses on Performance Measurement and Optimization of Digital Marketing among Young Managers (Case study: Digital-based Companies in Tehran)
This research aims to study the effect of using digital (web-based) analyses in performance measurement and optimization of digital marketing in digital-based companies in Tehran. The data collection tool was a researcher-made questionnaire. A panel of experts and supervisor were asked to measure the validity of the questionnaire. For reliability analysis of this tool, Cronbach’s alpha test was...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملThe Role of Digital Technologies as an Alternative for Face-to-Face Knee Rehabilitation: A Systematic Review
Background: Digital technologies, including mobile applications, websites, and wearable devices, like smartwatches are among the newest approaches in prevention, care, and treatment studies; they could provide public access to high-quality rehabilitation services. The current review study aimed to evaluate the effects of digital technologies for enhancing physical activity, as well as improving...
متن کاملHybred: An OCR Document Representation for Classification Tasks
The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to th...
متن کاملThe Impact of Digital Government on Whistleblowing and Whistle-blower Protection: Explanatory Study
This paper focuses on the contribution of digital government (DGOV) to Whistleblowing (WB). While considerable efforts have been devoted to DGOV and WB separately, research work at the intersection of these two domains is very scarce; hence and a systematic DGOV for WB (DGOV4WB) research framework has yet to emerge. This paper aims to identify the potential issues in whistleblowing and explore ...
متن کامل